DS202W - W11+1 Summative

Author

31867

Part 1: Supervised Learning

Answering the questions

How would you create the dataset for this task?

To create the dataset, I took the following steps:

  • Creating initial dataset: transform categorical data into numerical, feature engineering, preparing comment text data for text analysis

  • Processing comment text (text analysis): creating corpus, conduting DFM using quanteda, dimension reducting using LSA

  • Finalizing dataset: preparing the dataset for supervised learning model using data preprossesing techniques and data standardization

Which technique(s) from the course would you use to address this research question?

To address the research question, I decided to use random forest model:

  • Splitting the data into training and testing

  • Building the model: recipe, workflow, model specification

And how would you interpret the results?

To evaluate the results, I used the following metrics:

  • Confusion matrix, f_meas, ROC curve, ROC_AUC value

I have completed all the codes for Part 1, please see below for more detailed explaination for each steps!

Setting up

Creating the initial dataframe

  1. Transforming the target variable ranking_type into numerical binary variable.

I have also transformed the string binary variables in the post dataset into numerical binary variables. Although they were not used in building the model, I’d like to include this step for the purpose of consistency.

post_data$ranking_type <- ifelse(post_data$ranking_type == "top", 0, post_data$ranking_type)
post_data$ranking_type <- ifelse(post_data$ranking_type == "controversial" , 1, post_data$ranking_type)

post_data$over_18 <- ifelse(post_data$over_18 == "FALSE" , 0, post_data$over_18)
post_data$over_18 <- ifelse(post_data$over_18 == "TRUE" , 1, post_data$over_18)
post_data$is_original_content <- ifelse(post_data$is_original_content == "FALSE" , 0, post_data$is_original_content)
post_data$is_original_content <- ifelse(post_data$is_original_content == "TRUE" , 1, post_data$is_original_content)
  1. Separating the data into two datasets, one that contains all the string variables and the other with all the numerical variables for the post dataset and the comment dataset. Removing rows where the permalink column is NA, keeping only the valid comment data.

Feature engineering: deciding what variables I want to include in my initial dataset. I selected the variables base on my domain knowledge

  • For the post dataset, obviously I need to include ranking_type and post_id because they are the target variable and the unique identifier for the posts, respectively.

  • I included title and post_hint- provides useful information on what the post is about and the type of post.

  • ups, upvote_ratio, score, subreddit_subscribers, num_comments - these numerical variables are features of a reddit post. Containing information such as the number of upvotes, the ratio of upvotes to downvotes, upvotes minus downvotes, the number of subscribers to the post’s subreddit, and the number of comments it received. The variables on the upvotes and downvotes received by a post could be a determinant of whether its popular among users or not liked by the users. If there are a lot of subscribers to a post’s subreddit, could mean that this is a highly talked about post. Again, if a post received a lot of comment, it could also means that a post is popular among the users.

  • For the comment dataset, again, need to include post_id for the same reason as above.

  • I also included body - containing the actual content of the comments. You will see in the following section that I used comment text as one of the features to my model. The rationale is that numerical variables regarding the posts can only tell us so much, it does not tell us about more contexual information on users’ sentiment towards the post. For instance, a post can have a lot of upvotes, receive a lot of comments, very little downvotes. If you just look at these information, you might think that its a top post because users like it, they talk about it and very little of them dislike it. Yet, there is the possibility for it being a controversial post. Therefore, I thought a better feature to be included would be the comment text. Moreover, the comment text could bring additional predictive power. Perhaps certain words are strongly associated with either ‘top’ or ‘controversial’ posts.

#Post dataset 
post_data_string <- 
  post_data %>%
  select(ranking_type,post_id,title,post_hint) 
post_data_num <-
  post_data %>%
  select(ranking_type,post_id,ups,upvote_ratio,score,subreddit_subscribers,num_comments)

#Comment dataset 
comments_data_string <- 
  comments_data %>%
  filter(!is.na(permalink)) %>%
  select(post_id,body)  
  1. Joining the post_data_string with the comments_data_string based on the post_id. This allows me to combine post information (such as title, the type of post, and whether its top or controversial) with the comments associated with each post.

4. From the previous dataframe, extracting the post_id and comment text, and combining all the comment text for each post into a single string, and filtering out any posts with no comments. The resulting dataframe will have one row per post.

df_cleaned <- comments_data_string %>%
  select(post_id, body) %>% #taking only the comment texts and the post_id
  group_by(post_id) %>%
  summarise(combined_body = paste(body, collapse = " ")) %>%
  filter(combined_body != "NA")

Processing comment text (text analysis)

In this part, I will be performing text analysis on the comments data using the ‘quanteda’ package.

  1. Creating a corpus from the dataframe containing post_id and body only.
corp_speeches_clean <- quanteda::corpus(df_cleaned ,text_field="combined_body")
  1. Assigning document names, so that each post in the corpus is associated with its corresponding post_id.
quanteda::docnames(corp_speeches_clean) <- df_cleaned$`post_id`
  1. Tokenization, splitting each text into individual tokens. Removing punctuation and English stopwords from the tokenized text using quanteda functions.
tokens_speeches <-
  quanteda::tokens(corp_speeches_clean, remove_punct = TRUE) %>%
  quanteda::tokens_remove(pattern = quanteda::stopwords("en")) 
  1. Creating n-grams from tokenized text
tokens_speeches <- tokens_ngrams(tokens_speeches, n=1:2)
  1. Creating document-term matrix (DFM)
  1. Trimming the DFM to remove tokens that appear less than 5 times, mainly to save memory
dfm_speeches <- dfm_trim(dfm_speeches, min_termfreq = 5)
  1. Dimension reduction using Latent Sentiment Analysis (LSA), reducing to 3 dimensions.

    I have tried doing PCA but failed due to lack of computing power on my laptop. LSA is a more suitable alternative.

df_lsa <- quanteda.textmodels::textmodel_lsa(dfm_speeches %>% dfm_tfidf(), nd=3)$docs %>% as.data.frame()
head(df_lsa)
                  V1          V2          V3
123umlm 9.080579e-06 0.008412474 0.001035316
123xt60 5.042194e-05 0.007606036 0.001124379
123y5dz 4.244996e-06 0.005603121 0.001398179
12432l6 7.967513e-05 0.004659981 0.001098597
12433je 3.292173e-04 0.009375394 0.001385170
1243yci 1.301247e-04 0.016354701 0.009441097

Finalizing the dataset

  1. Combining post_id with the reduced dimension dataframe; converting it to a dataframe; renaming the first column as as post_id for consistency
df_word_frequency <- convert(dfm_speeches, to="data.frame")
combined_df <- cbind(df_word_frequency[,1], df_lsa) 
y_df <- as.data.frame(combined_df)
names(y_df)[1] <- "post_id"
  1. Combine numerical variables from post data with the reduced dimension dataframe, effectively combining all my features into one dataset
y_df_1 <- post_data_num %>% 
          left_join(y_df, by= "post_id") 
  1. Getting rid of the NAs
data_clean <- na.omit(y_df_1)  
  1. Taking only the numerical features for my final dataset
data_clean <- select(data_clean,-post_id)
  1. Turing the final dataset it to a numerical dataframe
data_cleaned <- as.data.frame(lapply(data_clean, as.numeric))
  1. Using scale() to apply standardization; ensuring the data is standardized before putting it into the model; keeping the original structure of the dataset with the target variable in the first column
first_column <- data_cleaned[, 1]
data_cleaned[, -1] <- scale(data_cleaned[, -1])
data_cleaned[, 1] <- first_column

Building the random forest model

Reasons for choosing a supervised random forest model:

  • I knew that I want to use a random forest model over a decision tree model because random forest can aggregate the predictions of decision trees, thus it is theoretically more robust than using decision tree alone. Random forest can also reduced the risk of overfitting because each decision tree is trained on a random set of features. I did try to build a boosted version of decision tree model, and found that the performance (by comparing their ROC_AUC) is not as good as the random forest model.

  • The random forest model also has the advantage of dealing with mixed data type, for this particular research question, it involves texual and numerical data. Random forest can also handle imbalance data well.

  • The reasons why I did not use a support vector machine (SVM) model is because its less interpretable compared to either decision trees or random forest.

  1. Splitting the data into training and testing sets (70:30), using the stratified technique to enhance robustness
set.seed(123)
split <- initial_split(data_cleaned, prop = 0.7, strata = ranking_type)
training_data <- training(split)
testing_data <- testing(split)
  1. This step was needed to transform the target variable from numerical to factor form
training_data$ranking_type <- as.factor(training_data$ranking_type)
testing_data$ranking_type <- as.factor(testing_data$ranking_type)
  1. Starting by defining the recipe, specifying that the target variable is ranking_type and I am using training data. Then I define model specification. Building the workflow that combines the recipe and the model specification. Lastly fitting the model using training data.
tree_rec <-
  recipe(ranking_type ~ .,
         data = training_data) 

bagging_spec <- 
   rand_forest(mtry = .cols()) %>%
  set_engine("randomForest", importance = TRUE) %>%
  set_mode("classification")

wflow_forest <- 
  workflow() %>% 
  add_recipe(tree_rec) %>% 
  add_model(bagging_spec)

random_forest_model <- 
  wflow_forest %>% 
  fit(data = training_data)

Evaluating the random forest model

  1. Fitting the model to the training set data. Evaluating the training random forest model using confusion matrices & summary statistics

The confusion matrix looks stunning! Having zeros for both the false positive and false negative cases means that, base on the training set data, the model did not incorrectly predict any positive or negative cases. Meaning that the model correctly identified all the cases where a post is either top or controversial.

conf_mat_training <- random_forest_model %>% 
  augment(training_data) %>%
  conf_mat(truth = ranking_type, .pred_class) 

conf_mat_training %>%
  summary()
# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         1    
 2 kap                  binary         1    
 3 sens                 binary         1    
 4 spec                 binary         1    
 5 ppv                  binary         1    
 6 npv                  binary         1    
 7 mcc                  binary         1    
 8 j_index              binary         1    
 9 bal_accuracy         binary         1    
10 detection_prevalence binary         0.550
11 precision            binary         1    
12 recall               binary         1    
13 f_meas               binary         1    
conf_mat_training %>%
  autoplot(type = "heatmap")

  1. Fitting the model to the testing set data. Evaluating the training random forest model using confusion matrices & summary statistics

The confusion matrix base on testing data tells us that overall, the random forest model performs pretty well in predicting top and controversial reddit post. There are relatively few classifications cases; 1 cases of false positive means that the model misidentified 1 controversial post as top post; 2 cases of false negative means that the model misidentified 2 top posts as controversial.

For this particular research question, I would say that this degree of model classifications is tolerable because when put into perspective, it is quite a rare chance for a post to be misclassified. In reality, this kind of event does happen with social media platforms’ algorithm and it is considered to be a tolerable mistake.

Other metrics such f_meas, which tell you the harmonic mean that balance out the precision and recall is 0.9949917, which is very close to 1. Meaning that the model can classify top or controversial with high accuracy.

conf_mat_testing <- random_forest_model %>% 
  augment(testing_data) %>%
  conf_mat(truth = ranking_type, .pred_class) 

conf_mat_testing %>%
  summary()
# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.995
 2 kap                  binary         0.989
 3 sens                 binary         0.993
 4 spec                 binary         0.996
 5 ppv                  binary         0.997
 6 npv                  binary         0.992
 7 mcc                  binary         0.989
 8 j_index              binary         0.989
 9 bal_accuracy         binary         0.995
10 detection_prevalence binary         0.548
11 precision            binary         0.997
12 recall               binary         0.993
13 f_meas               binary         0.995
conf_mat_testing %>%
  autoplot(type = "heatmap")

  1. Visualizing the importance of each feature variables in the random forest model for predicint the target variable. Using the vip() function that generates variable importance plots for the model.

This plot is useful in showing us that upvote_ratio is the feature in our random foest model that is the most important when it comes to predicting whether a post is top or controversial.

random_forest_fit <- random_forest_model %>% extract_fit_engine()
random_forest_fit %>%
  vip(geom = "col", aesthetics = list(fill = "midnightblue", alpha = 0.8)) +
  scale_y_continuous(expand = c(0, 0))

  1. Creating a ROC plot to evaluate the random forest model’s performance on the testing data.

As you can see from the plot, the ROC curve is very very close to the top right corner of the plot, meaning that its very highly accurate and robust. The ROC_AUC value is 0.9999661, again, confirming that the model can produce high accuracy in predictin whether a post is top or controversial.

  random_forest_model %>% 
  augment(testing_data) %>%
  roc_auc(truth = ranking_type, .pred_1, event_level="second") 
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary          1.00
  random_forest_model %>% 
  augment(testing_data) %>%
  roc_curve(truth=ranking_type, .pred_1, event_level="second") %>%
  autoplot() + 
  geom_point(aes(x=1-specificity, y=sensitivity, color=.threshold)) + 
  scale_color_gradient(name="Threshold", low = "#c6733c", high="#3cc6b8", limits=c(0, 1)) + 
  labs(title="Meet the ROC curve",
       subtitle="This random forest model performs really well!", 
       x="(1 - specificity) = 1 - TN/N",
       y="(sensitivity) = TP/P")

A word on overfitting

I believe that the random forest model is indeed suffering from the problem of overfitting:

  • Although the model did not perform significantly better on the training data compared to the testing data. According to common understanding, this is one of the indicators suggesting that the model is not being overfitted.

  • However, the dataset is quite small, and I believe I used a moderate amount of features to train my random forest model, so the model could very much be overfitted to the training data set.

Part 2: Similarity

A lot of the codes I used below are similar to the ones in Part 1, so I won’t be going into details explaining them unless necessary.

Feature Engineering

  1. Pre-processing the post titles and merging them together with comments

I plan to use reddit posts’ titles and the comments they received as features used to calculate their similarities.

post_title_df <- data.frame(post_id = post_data_string$post_id, title = post_data_string$title)

df_combined <- df_cleaned %>%
  left_join(post_title_df, by = "post_id") %>%
  mutate(full_text = str_c(title, " ", combined_body)) %>%
  select(post_id, full_text)
  1. Remove duplicates
df_combined <- df_combined %>%
  distinct(post_id, .keep_all = TRUE)

Generating the TF-IDF Matrix

  1. Text analysis
  1. Reducing the size of the DFM matrix to include only the tokens that appear in at least 5% of the whole body of text.
min_doc_freq <- 0.05 * ndoc(dfm_speeches1)  
dfm_trimmed <- dfm_trim(dfm_speeches1, min_docfreq = min_doc_freq) 
  1. Compute the TF-IDF matrix from the trimmed document-feature matrix

The reason why TF-IDF matrix is being calculated here is because TF-IDF can emphasize the more important words in a post while down weighting the less informative ones. TF-IDF also normalizes document length.

tfidf <- dfm_tfidf(dfm_trimmed)

Calculate Cosine Similarity

  1. Calculating cosine similarity base on TI-IDF matrix.

The rationale behind using cosine similarity is pretty straight-forward. Cosine similarity is most suitable for textual data, which is what we are using in this case, because it measures the semantic similarity of reddit posts. Whereas euclidean similarity is not so ideal for textual data. Because text data tends to be high dimensional, and euclidean similarity cannot really account for the distance between points in high dimensional space.

similarity_matrix <- textstat_simil(tfidf, method = "cosine")
  1. Converting similarity matrix to a dataframe for visualization
similarity_df <- as.data.frame(as.matrix(similarity_matrix))
  1. Generating a heatmap visualization of cosine similarities between reddit posts

Before making the heatmap, I decided to only visualize a subset of the similarity matrix because the original dimension of the similarity matrix is 1813 x 1813, which is tooo big. Therefore, I am only visualizing the cosine similarity values between the first 3 reddit posts.

subset_df <- similarity_df[1:3, 1:3]
head(subset_df)
           123umlm    123xt60    123y5dz
123umlm 1.00000000 0.12687007 0.07998254
123xt60 0.12687007 1.00000000 0.08206813
123y5dz 0.07998254 0.08206813 1.00000000
  1. Converting the subset of cosine similarities into a long format

I could not find a better way to convert the above subset of cosine similarities values between 3 reddit posts into a long format required for visualization. So I had to manually construct the dataframe myself.

cosine_similarities <- data.frame(
  Text1 = c("Reddit_Post1", "Reddit_Post1", "Reddit_Post1", "Reddit_Post2", "Reddit_Post3", "Reddit_Post3"),
  Text2 = c("Reddit_Post1", "Reddit_Post2", "Reddit_Post3", "Reddit_Post2", "Reddit_Post2", "Reddit_Post3"),
  Cosine_Similarity = c(1.00000000, 0.12783976, 0.08988606, 1.00000000, 0.09223651, 1.00000000)
)
  1. Plotting the heatmap

The heatmap conveniently visualizes the similarities between a pair of reddit posts. As one can observe by referring to the cosine similarity legend, the black blocks represent 1 because a pair of the same exact same reddit post would not be different at all. Whereas Reddit_Post1 appears to be slightly similar to Reddit_Post2, given the slightly dark colour.

ggplot(cosine_similarities, aes(x = Text1, y = Text2, fill = Cosine_Similarity)) +
  geom_tile() +
  scale_fill_gradient(low = "pink", high = "black", name = "Cosine Similarity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Reddit Posts", y = "Reddit Posts", title = "Reddit Post Cosine Similarity Heatmap")

Part 3

Research Question

The Research Question is: what are the most distinctive topics that appear in controversial posts

Motivation: Online interactions constitute an important dimension of our lives. Reddit is an online discussion platform used by a lot of people, including me, to share their opinion and interact with other people onlines. However, it is very common, especially on reddit, for users to start arguments and debate with one another on topics that are controversial. Reddit is even known to be the platform where people can express their opinion on something to a degree that constitute aggressive, or disrespectful. Therefore, I believe it is imperative for the management of the Reddit platform to try and ensure that the reddit environment is a safe place for people to discuss. I believe the research question I plan to tackle will be especially fruitful because it seeks to find out what are the distinctive controversial topics that reddit users are posting. Through my finding, the management of reddit can regulate the environment better, through ways such as regulating the use of certain words that is related to the topics I found.

Feature engineering

  1. Building a dataframe using with only reddit posts’ titles, their post_id and ranking_type and filtering to keep only the controversial posts
contro_posts <- post_data_string[post_data_string$ranking_type == 1, c("post_id", "title", "ranking_type")]
  1. Removing the ranking_type column
post_title_df <- contro_posts[, c("post_id", "title")]

Text analysis on the titles

  1. Since I already have detailed explanation on what each of the following text analysis codes mean from Part 1, I don’t think its necessary to repeat myself again. The code here basically tokenized the titles, get rid of punctuation and stopwords, and turn the titles into DFM format.
  1. Here I am converting the DFM of the titles into dataframe for later use in the anomaly detection section. This output is the dimension of the resulting dataframe.
dfm_as_data_frame <- quanteda::convert(dfm_speeches2, to="data.frame")
dim(dfm_as_data_frame)
[1]  1000 16596
  1. Performing LSA on the DFM of titles to reduce its dimension.

Again, LSA dimension reductoin technique is used here instead of PCA because LSA is more suited for text data. Performing PCA on this text data cannot be done on laptop.

df_lsa2 <- quanteda.textmodels::textmodel_lsa(dfm_speeches2 %>% dfm_tfidf(), nd=3)$docs %>% as.data.frame()

Anomaly Detection

The reason why I have included an anomaly detection process here is because when I was examining the summary statistics of the three components (V1, V2, V3) that was produced post-LSA, I realized that the minimum and maximum values of those three components differs are significantly different from the inter-quartile range.Therefore, I thought it would be the best to examine what those exact anomalies are and remove them.

  1. Using the function plot_ly(), I am able to produce a 3D scatterplot that visualized the data points, which would allow me to visually identify any significant deviations from the overall data pattern.

From this plot, I can identify 3 very obvious outliers from the main cluster of data points. Since each data point is being labelled with their corresponding post_id, I can easily identify them.

plot_ly(data =  bind_cols(df_lsa2, dfm_as_data_frame), 
        x = ~V1, 
        y = ~V2, 
        z = ~V3,
        size = 3,
        alpha = 0.7,
        type="scatter3d", 
        mode="markers", 
        text=~paste('doc_id:', doc_id))
  1. Filtering out the outliers and look at their title. Providing useful insights into the outlier titles we have found.
post_title_df %>% 
  filter(post_id %in% c("15yvnc9", "17ptecj","17hgbf3", "190pyju")) %>%
  pull(title) %>% 
  print()
[1] "Killing Gaza (2018) - Jewish journalists Dan Cohen, Max Blumenthal visited and lived Gaza over a period 3 years by staying with a Palestinian family. This is a chilling documentation of how israelis came to loathe Arabs and the war crimes committed by the israeli military they Witnessed [01:36:49]"      
[2] "TIL, the sun in our solar system, Sol, is not a member of any constellation. Constellations are categorized by their shape as perceived from our solar system. Therefore, for Sol to be a member of a constellation we would need to travel to another star system to perceived the new \"shape\"."              
[3] "TIL that following Zimbabwe's narrow win over Pakistan in 2022 ICC T20 Cricket World Cup, Zimbabwean fans reminded Pakistan of having sent comedian Asif Muhammad, a doppelganger of Mr. Bean, in 2016. Zimbabwe President Emmerson Mnangagwa also asked Pakistan to send the real \"Mr Bean\" to Zimbabwe."     
[4] "1948: Creation &amp; Catastrophe (2023) - History shattering doc showing first-hand survivor accounts of terrorism in Mandate of Palestine. Rape infront of family members recounted by one of the survivors. This was the last chance to document creation of a state and the expulsion of a nation. [01:25:30]"
  1. Eliminating the outliers from the dataset.
post_title_df_clean <- post_title_df %>%
  filter(!post_id %in% c("15yvnc9", "17ptecj","17hgbf3", "190pyju"))

Finding the best clusters

Before I go into finding the best number of clustering as part of answering my research question, I need to run the LSA process again because now I have a new dataset that is rid of anomalies.

  1. Running textual analysis again using the anomaly-freed dataset
  1. Running the LSA dimension reduction process again on the anomaly-freed dataset.
df_lsa_cleaned_title <- quanteda.textmodels::textmodel_lsa(dfm_speeches3 %>% dfm_tfidf(), nd=3)$docs %>% as.data.frame()
  1. Finding the best number of clusters.

Base on the numericalized text data from using LSA analysis, this step computes the most optimal number of clusters to cluster the text data. This code is able to generate a conclusion that the best number of clusters is 3. In other words, the titles of reddit posts can be best seperated into three clusters base on their patterns.

res.nbclust <- df_lsa_cleaned_title %>% select(V1,V2,V3) %>%
    scale() %>% 
    NbClust(distance = "euclidean",
            min.nc = 2, max.nc = 10, 
            method = "complete", index ="all") 
Warning in pf(beale, pp, df2): NaNs produced

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 5 proposed 2 as the best number of clusters 
* 9 proposed 3 as the best number of clusters 
* 4 proposed 4 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 4 proposed 8 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
******************************************************************* 

Topic Modelling/Extraction

  1. Performing topic modelling on the DFM data using LDA technique

This step identifies topics present in reddit post titles. Since in the previous step we have identified that the optimal number of clusters is 3. This step shows the top 10 most prominent topics in each of the clusters. LDA technique is used here instead of K-means (another method of clustering) because we are dealing with textual data here, and LDA is better in capturing latent topics in a body of text. To be honest, I have tried to cluster the reddit title text data using k-means clustering, and the resulting clusters are very very very imbalance. By that I mean one cluster has one row, and the other has 900 rows. It can be further attributed to the fact that K-means is not good at dealing with very dense clusters, and as you can see from the 3D scatter plot above, title text data is very dense.

tmod_lda <- topicmodels::LDA(dfm_speeches2 %>% dfm_subset(ntoken(dfm_speeches2) > 0), k = 3)
tmod_lda %>% topicmodels::terms(10)
      Topic 1  Topic 2       Topic 3  
 [1,] "gaza"   "john"        "gaza"   
 [2,] "people" "oliver"      "israeli"
 [3,] "new"    "john_oliver" "oc"     
 [4,] "israel" "gaza"        "one"    
 [5,] "like"   "oc"          "people" 
 [6,] "|"      "just"        "new"    
 [7,] "men"    "people"      "says"   
 [8,] "john"   "like"        "games"  
 [9,] "2023"   "israel"      "first"  
[10,] "00"     "says"        "war"    
  1. Renaming column names in preparation for producing the graph below.
new_names <- c("docname","title")
names(post_title_df_clean) <- new_names
  1. Extracting the topics produced by LDA analysis and intergrating the topics with the dataframe containing Reddit post titles.
df_topics <- tmod_lda %>% topicmodels::topics() %>% as.data.frame()
df_topics <- tibble::rownames_to_column(df_topics, "doc_id")
colnames(df_topics) <- c("docname", "topic")
df_topics$topic <- as.factor(df_topics$topic)

post_title_df_clean <- left_join(post_title_df_clean, df_topics, by="docname")

Interpreting Results

Important Disclaimer - Must Read: I realize every time I run the code in this section from top to bottom, the wordcloud appear to change everytime! Therefore, the key topics and subsequent explanation I wrote in the following section would not align with the wordcloud, because they were based on the first time I run those codes.

The first distinctive topic

  1. Using textstat_keyness() function to calculate the keyness statistics of the most common tokens with controversial reddit posts in cluster 1.
selected_cluster = 1
tstat_key <- textstat_keyness(dfm_speeches3, 
                              measure="chi", 
                              target = case_when(is.na(post_title_df_clean$topic) ~ FALSE, 
                                                 post_title_df_clean$topic == selected_cluster ~ TRUE,
                                                 .default = FALSE))
textplot_keyness(tstat_key, labelsize=2)

This graph compares the keyness statistics of the cluster 1 controversial posts with all other controversial posts. The advantage of this comparison lies in its ability to identify topics most salient in controversial reddit posts but less common in the other two clusters.

As one can observe from the wordcloud below, the most common words in cluster 1 are “just”, “like”, “game”, “women”. It is a little bit hard to conclude a salient topic from these most common words. However, contextualizing these words in relation to my domain knowledge, I would argue that in cluster 1, the most distinctive type of controversial reddit posts that are “women” and “game”. Which makes a lot of sense, because reddit is quite a male dominated platform and a lot of them do have very different opinions on “games” and “women”.

textplot_wordcloud(tstat_key, comparison=TRUE, min_count=2)

The second distinctive topic

From here till the third distinctive topic, the coding is the same.

selected_cluster = 2
tstat_key <- textstat_keyness(dfm_speeches3, 
                              measure="chi", 
                              target = case_when(is.na(post_title_df_clean$topic) ~ FALSE, 
                                                 post_title_df_clean$topic == selected_cluster ~ TRUE,
                                                 .default = FALSE))
textplot_keyness(tstat_key, labelsize=2)

The most common words in cluster 2 are “gaza”, “israeli”, “2023”, “palestinian”. Contextualizing these words, one can conclude that the second group of most distinctive topics in controversial reddit posts is the “Israeli-Palestinian war in 2023”. I think this is almost an inevitable topic to appear as most distinctive because the data source we are using was over the past year (mid-March 2023 to mid-March 2024), which is exactly when the war broke out. Obvisouly different groups of users have different opinions on the war, thus making it very controversial and widely talked about.

textplot_wordcloud(tstat_key, comparison=TRUE, min_count=2)

The third distinctive topic

selected_cluster = 3
tstat_key <- textstat_keyness(dfm_speeches3, 
                              measure="chi", 
                              target = case_when(is.na(post_title_df_clean$topic) ~ FALSE, 
                                                 post_title_df_clean$topic == selected_cluster ~ TRUE,
                                                 .default = FALSE))
textplot_keyness(tstat_key, labelsize=2)

The most common words in cluster 3 are “john”, “john_oliver”, “oliver”. One can very straight-forwardly conclude that the final most distinctive topics in controversial reddit posts is the “john oliver”. I have no idea who John Oliver is, and after some goggling, I found out that he is a British Comedian. I mean, I didnn’t know before that comedians can be controversial too!

textplot_wordcloud(tstat_key, comparison=TRUE, min_count=2)

Conclusion

The research question - “what are the most distinctive topics that appear in controversial posts”, can now be answered, and they are:

1. “women” and “game”

2. “Israeli-Palestinian war in 2023”

3. “john oliver”

This finding is not only useful for Reddit moderators to understand the topics in which controversy and disagreement stems from on the platform. It also reflect the wider social attitudes that reddit-users have in 2023-2024, and that they have very divided opinions on these topics.